Credit Card User Churn Prediction¶

Problem Statement¶

Business Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?¶

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?¶
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:¶
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the project.¶

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '___' are provided in the notebook that needs to be filled with an appropriate code to get the correct result. With every '___' blank, there is a comment that briefly describes what needs to be filled in the blank space.
  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Importing necessary libraries¶

In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
#!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [ ]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score

from imblearn.over_sampling import SMOTE
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
In [ ]:
#mounting the googledrive to see the dataset file saved on google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Loading the dataset¶

In [ ]:
#Accessing google drive to read dataset
churn = pd.read_csv('/content/drive/My Drive/BankChurners.csv')

Data Overview¶

The initial steps to get an overview of any dataset is to:

observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not get information about the number of rows and columns in the dataset find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected. check the statistical summary of the dataset to get an overview of the numerical columns of the data.

Checking the shape of the dataset¶

In [ ]:
# Checking the number of rows and columns in the training data
churn.shape
Out[ ]:
(10127, 21)

Observation - Dataset has 10127 rows & 21 columns.

In [ ]:
# Creating a copy of the data to another variable to avoid any changes to original data
data = churn.copy()

Displaying the first few rows of the dataset¶

In [ ]:
# Viewing the first 5 rows of the dataset
data.head(5)
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000

Observation - Shows top 5 rows of the data set.

In [ ]:
# Viewing the last 5 rows of the dataset
data.tail(5)
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189

Observation - Shows last 5 rows of the data set.

Checking the data types of the columns for the dataset¶

In [ ]:
# checking the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Observations -

  • There are 21 columns & 10,127 entries in the dataset
  • 5 columns are of type float64, 10 columns are of type int64, and 6 columns are of type object.
  • Education_Level (1,519 missing entries) & Marital_Status (749 missing entries) columns are having missing data. All other columns have 10,127 non-null values, indicating no missing data in these columns.

Checking for duplicate values (Sanity Check)¶

In [ ]:
# Checking duplicate values in the dataset
data.duplicated().sum()
Out[ ]:
0

Observation- Data doesnt have any duplicate values.

Checking for missing values (Sanity Check)¶

In [ ]:
# Check for missing values in the data -
data.isnull().sum()
Out[ ]:
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Observation -

  • Education_Level column shows 1519 missing values.
  • Marital_Status column shows 749 missing values.
  • These missing values needs to be imputed after the data is split into training, validation, and test data sets to prevent data leakage.

Statistical summary of the dataset¶

In [ ]:
#Checking statistical summary of the numerical columns in the data
data. describe().T
Out[ ]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.000 739177606.334 36903783.450 708082083.000 713036770.500 717926358.000 773143533.000 828343083.000
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Dependent_count 10127.000 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999

Observations -

CLIENTNUM - This column can be dropped,as this is unique ID column for all customers and will not add up anything for further analysis.

Customer_Age - Average customers are 46 years, a min is 26, and a max is 73 years.

Dependent_count - Average custmers has 2 dependents, There are no minimum dependents, and a max of 5 dependents.

Months_on_book AVergae customers on books are 35.9 months, a min of 13, and a max of 56 months.

Total_Relationship_Count Average customer relationshio count is almost4 products with the bank, a min of 1, and a max of 6 products with the bank.

Months_Inactive_12_mon Average Inactive months are 2.3 months, a min of 0, and a max of 6 months.

Contacts_Count_12_mon Average customers been contacted of 2.4 times , a min of 0, and a max of 6 contacts.

Credit_Limit Average Credit limit is 8632 dollars, a min of 1438, and a max of 34516 dollars (rounded to nearest dollar). This is a very large range.

Total_Revolving_Bal Average revolving bal is 1163 dollars, a min of 0, and a max of 2517.

Avg_Open_To_Buy Average Open to Buy Credit Line is 7469 dollars, a min of 3, and a max of 34516 dollar. This is a very large range.

Total_Amt_Chng_Q4_Q1 Average amount changed is 0.76, a min of 0, and a max of 3.397. This is a ratio of amount spent in Q4 to amount spend in Q1 (Q4/Q1).

Total_Trans_Amt Average transaction 4404 dollars, a min of 510, and a max of 18484 dollars.

Total_Trans_Ct Average 64.8 are total transactions, min of 10, and a max of 139 total transactions.

Total_Ct_Chng_Q4_Q1 total change in transation count is 0.71, a min of 0, and a max of 3.71. This is a ratio of number of transactions in Q4 to number of transactions in Q1 (Q4/Q1).

Avg_Utilization_Ratio Average card utilization ratio is 27.5%, a min of 0%, and a max of 99.9%. This is the customers percent of credit used.

In [ ]:
#Observing Object /categories from the dataset
data.describe(include=["object"]).T
Out[ ]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436

Observations

Attrition_Flag has 10127 non-null entries and 2 unique entries, with the most frequent being "Existing Customer".

Gender has 10127 non-null entries and 2 unique entries, with the most frequent being "F".

Education_Level has 8608 non-null entries and 6 unique entries, with the most frequent being "Graduate".

Null values are present and will be imputed after data is split into training, validation, and test sets to avoid data leakage.

Marital_Status has 9369 non-null entries and 3 unique entries, with the most frequent being "Married".

Null values are present and will be imputed after data is split into training, validation, and test sets to avoid data leakage.

Income_Category has 10127 non-null entries and 6 unique entries, with the most frequent being "Less than 40k"

Card_Category has 10127 non-null entries and 4 unique entries, with the most frequent being "Blue".

Unique Value Analysis of each Object column¶

In [ ]:
# Understanding Unique values for each Object column
for i in data.describe(include=["object"]).columns:
    print("Unique values in", i, "are :")
    print(data[i].value_counts())
    print("*" * 50)
Unique values in Attrition_Flag are :
Attrition_Flag
Existing Customer    8500
Attrited Customer    1627
Name: count, dtype: int64
**************************************************
Unique values in Gender are :
Gender
F    5358
M    4769
Name: count, dtype: int64
**************************************************
Unique values in Education_Level are :
Education_Level
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: count, dtype: int64
**************************************************
Unique values in Marital_Status are :
Marital_Status
Married     4687
Single      3943
Divorced     748
Name: count, dtype: int64
**************************************************
Unique values in Income_Category are :
Income_Category
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: count, dtype: int64
**************************************************
Unique values in Card_Category are :
Card_Category
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: count, dtype: int64
**************************************************

Observations Income_Category shows 1112 values as abc, which are needed to be treated with further analysis.

Data Cleaning(Data drop)¶

In [ ]:
# CLIENTNUM consists of unique IDs of clients and hence, will not add any value to the modeling so dropping it.
data.drop(["CLIENTNUM"], axis=1, inplace=True)

Data Pre-processing¶

In [ ]:
## Encoding Existing and Attrited customers to 0 and 1 respectively, for analysis.
#Encoding "Attrition_Flag" data to 0 & 1, where 0 representing "Existing Customer" and 1 representing "Attrited Customer" for further analysis or modeling purposes.

data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
In [ ]:
#Verifying Top 5 rows of the new data frame after replacement
data.head()
Out[ ]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 0 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 0 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 0 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 0 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 0 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000

Observation Attrition_Flag column shows Zero values instead of 'Existing customer' or'Atrited customer'

In [ ]:
# Checking the values of the Income_Category column.
data['Income_Category'].value_counts()
Out[ ]:
Income_Category
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: count, dtype: int64
In [ ]:
# Replacing "abc" entries in the Income_Category column with np.nan.
data['Income_Category'].replace('abc', np.nan, inplace=True)
In [ ]:
# Checking the new values of the Income_Category column.
data['Income_Category'].value_counts()
Out[ ]:
Income_Category
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
$120K +            727
Name: count, dtype: int64
In [ ]:
# Observing the amount of non-null values in the Income_Category column.
data['Income_Category'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 10127 entries, 0 to 10126
Series name: Income_Category
Non-Null Count  Dtype 
--------------  ----- 
9015 non-null   object
dtypes: object(1)
memory usage: 79.2+ KB

Exploratory Data Analysis¶

The below functions need to be defined to carry out the Exploratory Data Analysis.¶

In [ ]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate analysis¶

Customer_Age

In [ ]:
histogram_boxplot(data, "Customer_Age", kde=True)

Observations

*Average customer age is 46 yr. data is normally distributed.

*Plot shows there are few outliers to the right for this variable.

Months_on_book

In [ ]:
histogram_boxplot (data,"Months_on_book", kde=True)

Observations

*'Months on Book' data looks normally distributed, with a high frequency of Mode.

*Data shows outliers,meaning possibly there are incorrect values.

*The avergae months on the book is 35 months.

Credit_Limit

In [ ]:
histogram_boxplot(data,"Credit_Limit", kde=True)

Observations

*Credit limit data is highly right skewed. The data shows too many outliers,means need further analysis.It seems like these values are just outside the range, but may be the credit limits.

*The average credit limit is approx 8500.

*Only one data point shows high credit limit i.e 35000, which need to be treated.

Total_Revolving_Bal

In [ ]:
histogram_boxplot(data,"Total_Revolving_Bal",kde=True)

Observations

*Max customers shows zero 'Total Revolving Bal',this data is slighly left skewed.

*Average 'Total_Revolving_bal' is around 1250 with the mean being slightly lower.

*2500 is max Total revolving balance.

Avg_Open_To_Buy

In [ ]:
histogram_boxplot(data,"Avg_Open_To_Buy",kde=True)

Observations -

*Avg open to buy column data is right skewed.

*There are many outliers observed in the data to the right side.

Total_Trans_Ct

In [ ]:
histogram_boxplot(data,"Total_Trans_Ct",kde=True)

Observations -

*Total Trans Ct column data is fairly distributed.

*There are negligible outliers observed on right.

Total_Amt_Chng_Q4_Q1

In [ ]:
histogram_boxplot(data,"Total_Amt_Chng_Q4_Q1",kde=True)

Observations -

*Total amt chng Q4_Q1 ' data has lot of outliers, which needs to be treated.

*data is fairly distributed with more outliers on the right side compared to left side.

Let's see total transaction amount distributed

Total_Trans_Amt

In [ ]:
histogram_boxplot(data,"Total_Trans_Amt",kde=True)

Observations -

*Total Trans Amt data has lots of outliers on right, means this column needs to be observed closly & treat the values.

*Data is uneually distributed & shows slightly right skewed.

*No customer shows 0 Trans Amount means all customers used their credit cards.

Total_Ct_Chng_Q4_Q1

In [ ]:
histogram_boxplot(data,"Total_Ct_Chng_Q4_Q1",kde=True)

Observations -

*"Total_Ct_Chng_Q4_Q1" column has many outliers, specially on right side compared to left side, which needs treatment.

*Data distribution seems normal.

Avg_Utilization_Ratio

In [ ]:
histogram_boxplot(data,"Avg_Utilization_Ratio",kde=True)

Observations -

*Avg_Utilization_Ratio Data is highly right skewed.

*Observation - Most of the customers have low Avg Utilization ratio meaning (how much available credit customers spent)

Dependent_count

In [ ]:
labeled_barplot(data, "Dependent_count")

Observation

*Customers with Dependent count as 3 shows the most credit card spending.

*Customers with dependent count 5 show lowest credit card spending.

*Average customer spending is beyond 1500.

Total_Relationship_Count

In [ ]:
labeled_barplot(data,"Total_Relationship_Count")

Observations -

*More number of Customers (2305) shows 3 yrs of Total relationship with Terabank.

*There very less customers shows 1 yr of total relationship with Bank.

*On an Avergae most number of customers have more than 3 years of total relationship with bank.

Months_Inactive_12_mon

In [ ]:
labeled_barplot(data,"Months_Inactive_12_mon")

Observation -

*Maximum number of customers are inactive since last 3 months

3282 customers are 2 months inactive. 2233 customers show 1 months inactive.

Contacts_Count_12_mon

In [ ]:
labeled_barplot(data,'Contacts_Count_12_mon')

Observations

*33% of customers have been contacted 3 times in the last 12 months.

*31% of customers have been contacted 2 times in the last 12 months.

*14% of customers have been contacted 1 times in the last 12 months.

Gender

In [ ]:
labeled_barplot(data,'Gender')

Observations - There are more number of female (5358) customers than Male (4769) customers

Let's see the distribution of the level of education of customers

Education_Level

In [ ]:
labeled_barplot(data,'Education_Level')

Observations -

Education column Data shows , mix of all level education of customers holding credit card. The Maximum number of customers are Grduates (3128), There are significant ammount of High school customers (2013.0) in the dataset. The least number of customers(451.0) are Doctorate.There are 516 customers which are Post-Graduates. There are significat customers(1487.0) which are uneducated. Inference - based on these observations, we must stidy High school student catagory & Uneducated customer catagory as high risk potential for credit card Churn.

Marital_Status

In [ ]:
labeled_barplot(data,'Marital_Status')

Observation - Maximum customers are Married (4687.0), however significant cusomers holding credit cards are single too (3943). There are less number of Divorced customers (748.0)

Let's see the distribution of the level of income of customers

Income_Category

In [ ]:
labeled_barplot(data,'Income_Category')

Observations - The maximum number of customers(3561.0) have income bracket of Less than 40K. , however there are some customers with income beyond $120 K. The 'abc' bar in the dataset reflects that there may be some anamolous data in this column,which needs further analysis.

Card_Category

In [ ]:
labeled_barplot(data,'Card_Category')

Observations - Maximum customers have 'Blue' credit card category. Negligible customers have Platinum credit cards, which can be removed from the dataset as those customers doesnt add much value to remaining data points. There is major difference between Blue cc holders & Silver cc holders(555.0). There are very less customers(116.0) who has Gold CCs.

Attrition_Flag

In [ ]:
labeled_barplot(data,'Attrition_Flag')

Observations - Attriction_Flag, Zero reflects existing customers.means they are still customers with Terabank & not attrited.

*There are total 1627 customers who attrited recently.

*This is the target variable.

In [ ]:
# creating histograms
data.hist(figsize=(14, 14))
plt.show()

Observations Generated all histplots at once, to explore overall data distribution for each column in the dataset. Further will Adjust visualization parameters as needed, based on the characteristics of the data and analysis further objectives.

Bivariate Distributions¶

Attrition_Flag vs Gender

In [ ]:
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag     0     1    All
Gender                           
All             8500  1627  10127
F               4428   930   5358
M               4072   697   4769
------------------------------------------------------------------------------------------------------------------------

Observations Plot shows there is a negligible difference between Male / Female Gender in terms of Attrition flag. meaning both genders customer Attrited almost equally. Gender column doesnt show much impact on Attrition Flag.

Attrition_Flag vs Marital_Status

In [ ]:
stacked_barplot(data, "Marital_Status","Attrition_Flag")
Attrition_Flag     0     1   All
Marital_Status                  
All             7880  1498  9378
Married         3978   709  4687
Single          3275   668  3943
Divorced         627   121   748
------------------------------------------------------------------------------------------------------------------------

Observations Marital status also doesnt show any major impact on Attrition flag.

Education_Level Vs Attrition_Flag

In [ ]:
stacked_barplot(data, 'Education_Level', 'Attrition_Flag')
Attrition_Flag      0     1   All
Education_Level                  
All              7237  1371  8608
Graduate         2641   487  3128
High School      1707   306  2013
Uneducated       1250   237  1487
College           859   154  1013
Doctorate         356    95   451
Post-Graduate     424    92   516
------------------------------------------------------------------------------------------------------------------------

Observations Education shows not major impact on Attrition rate.

Attrition_Flag vs Income_Category

In [ ]:
stacked_barplot(data,"Income_Category", "Attrition_Flag")
Attrition_Flag      0     1   All
Income_Category                  
All              7575  1440  9015
Less than $40K   2949   612  3561
$40K - $60K      1519   271  1790
$80K - $120K     1293   242  1535
$60K - $80K      1213   189  1402
$120K +           601   126   727
------------------------------------------------------------------------------------------------------------------------

Observations - Income_category doesnt show any significat impact on Attrition rate.

Attrition_Flag vs Contacts_Count_12_mon

In [ ]:
stacked_barplot(data,"Contacts_Count_12_mon","Attrition_Flag")
Attrition_Flag            0     1    All
Contacts_Count_12_mon                   
All                    8500  1627  10127
3                      2699   681   3380
2                      2824   403   3227
4                      1077   315   1392
1                      1391   108   1499
5                       117    59    176
6                         0    54     54
0                       392     7    399
------------------------------------------------------------------------------------------------------------------------

observations

Customers with less contacts in the last 12 months attrited more often.

Let's see the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)

Attrition_Flag vs Months_Inactive_12_mon

In [ ]:
stacked_barplot(data,"Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag             0     1    All
Months_Inactive_12_mon                   
All                     8500  1627  10127
3                       3020   826   3846
2                       2777   505   3282
4                        305   130    435
1                       2133   100   2233
5                        146    32    178
6                        105    19    124
0                         14    15     29
------------------------------------------------------------------------------------------------------------------------

Observations Months_Inactice_12_mon does have some affect on attrition.

Attrition_Flag vs Total_Relationship_Count

In [ ]:
stacked_barplot(data,"Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag               0     1    All
Total_Relationship_Count                   
All                       8500  1627  10127
3                         1905   400   2305
2                          897   346   1243
1                          677   233    910
5                         1664   227   1891
4                         1687   225   1912
6                         1670   196   1866
------------------------------------------------------------------------------------------------------------------------

Observations Customers that have 1 or 2 products with the bank attrit the most, followed by customers who have 3 products. Customers that have either 4, 5, or 6 products with the bank attrit at nearly the same rates.

Attrition_Flag vs Dependent_count

In [ ]:
stacked_barplot(data,"Dependent_count", "Attrition_Flag")
Attrition_Flag      0     1    All
Dependent_count                   
All              8500  1627  10127
3                2250   482   2732
2                2238   417   2655
1                1569   269   1838
4                1314   260   1574
0                 769   135    904
5                 360    64    424
------------------------------------------------------------------------------------------------------------------------

Observation From this stacked barplot, Dependent_Count does not show much effect on attrition.

EDA Observations (Questions & Answers)¶

Questions:

1. How is the total transaction amount distributed?

Total Trans Amt data has lots of outliers on right, means this column needs to be observed closly & treat the values.Data for this column is uneually distributed & shows slightly right skewness. No customer shows 0 Trans Amount means all customers used their credit cards.The Total_Trans_Amt distribution appears consistent across both existing and attrited customers. *Attrited customers have a median Total_Trans_Amt of 2500 Dollars, whereas existing customers show a higher median, nearing 4000 Dollars. Notably, the Interquartile Range (IQR) of Total_Trans_Amt for attrited customers is considerably narrower compared to existing customers. Additionally, the maximum Total_Trans_Amt for attrited customers is roughly half that of existing customers.

2. What is the distribution of the level of education of customers? Education column Data shows , mix of all level education of customers holding credit card. The Maximum number of customers are Grduates (3128), There are significant ammount of High school customers (2013.0) in the dataset. The least number of customers(451.0) are Doctorate.There are 516 customers which are Post-Graduates. *There are significat customers(1487.0) which are uneducated.Based on these observations, we must study High school student catagory & Uneducated customer catagory as high risk potential for credit card Churn.

3.What is the distribution of the level of income of customers? The maximum number of customers(3561.0) have income bracket of Less than 40K, however there are some customers with income beyond $120 K. 39% of customers make less than 40k. 19% of customers make between 40k - 60k. 17% of customers make between 80k - 120k.

4. How does the change in transaction amount between Q4 and Q1(total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)? total change in transation count is 0.71, a min of 0, and a max of 3.71. This is a ratio of number of transactions in Q4 to number of transactions in Q1 (Q4/Q1). Distribution of Total_Ct_Chng_Q4_Q1 is centered around 0.5 for attrited customers, while its centered around 0.7 for existing customers. Median of Total_Ct_Chng_Q4_Q1 for existing customers is greater than that of 75% of attrited customers. Min of Total_Ct_Chng_Q4_Q1 for existing customer much greater than that of attrited customers. Max looks same for both existing & attrited customers.Total_Ct_Chng_Q4_Q1 for both attrited and existing customers are normally distributed.

5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)? Average Inactive months are 2.3 months, a min of 0, and a max of 6 months. Maximum number of customers are inactive since last 3 months 3282 customers are 2 months inactive. 2233 customers show 1 months inactive. 33% of customers have been contacted 3 times in the last 12 months. 31% of customers have been contacted 2 times in the last 12 months. 14% of customers have been contacted 1 times in the last 12 months.

6. What are the attributes that have a strong correlation with each other?

Avg_Open_to_Buy and Credit_Limit are completely positively correlated by necessity. As a customer's credit limit goes up, their open to buy also increases.Total_Trans_Amt and Total_Trans_Ct are very highly positively correlated. This is natural as the more number of transations a customer makes, the more money customer will spend. Customer_Age and Months_on_book are highly positively correlated, as customer Age increases, their time with the bank increases.Total_Revolving balance and Avg_Utilization_Ratio is positively correlated. This makes sense because if a customer has a high utilization, they will likely have a higher revolving balance. Avg_Open_To_Buy and Avg_Utilization_Ratio are negatively correlated as higher the customers utilization is, the less their amount open to buy will be.Credit_Limit and Avg_Utilization_Ratio are negatively correlated,as customers with a higher credit limit tend to have a lower utilization.

Total_Revolving_Bal vs `Attrition_Flag'

In [ ]:
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")

Observations -

Total_Revolving_Bal shows similar distributions for both attrited and existing customers. The existing customers have a bulge in the center. Attrited customers have peaks at both the min and max of the distribution. The median Total_Revolving_Bal for existing customers is higher than that of more than the attrited customers.

Attrition_Flag vs Credit_Limit

In [ ]:
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")

Observations Credit_Limit plots shows almost identical distribution for existing customer and attrited customers.

Attrition_Flag vs Customer_Age

In [ ]:
distribution_plot_wrt_target(data, "Customer_Age", "Attrition_Flag")

Observations Customer_Age plot shows almost identical distribution for existing customer and attrited customers.

In [ ]:
distribution_plot_wrt_target(data, "Months_Inactive_12_mon", "Attrition_Flag")

Observations Average Inactive months are 2 months, a min of 0, and a max of 6 months. Maximum number of customers are inactive since last 3 months.

Total_Trans_Ct vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")

Observations Total_Trans_Ct plots shows normally distributed for attrited customers.

Attrited customers have a much lower median and max Total_Trans_Ct than existing customers. The distribution of Total_Trans_Ct for attrited customers is centered around 50 while for existing customers its around 70.

Total_Trans_Amt vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")

Observations The distribution of Total_Trans_Amt looks similar for both existing and attrited customers. The median Total_Trans_Amt for attrited customers is 2500, while the median for existing customers is closer to 4000. The IQR of Total_Trans_Amt for attrited customers is much smaller than that of existing customers. The maximum Total_Trans_Amt for attrited customers is about half as much compared to exsiting customers.

Let's see the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)

Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")

Observations Total_Ct_Chng_Q4_Q1 for both attrited and existing customers are normally distributed. Distribution of Total_Ct_Chng_Q4_Q1 is centered around 0.5 for attrited customers, while its centered around 0.7 for existing customers. Median of Total_Ct_Chng_Q4_Q1 for existing customers is greater than that of 75% of attrited customers. Min of Total_Ct_Chng_Q4_Q1 for existing customer much greater than that of attrited customers. Max looks same for both existing & attrited customers.

Avg_Utilization_Ratio vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")

Observations The median Avg_Utilization_Ratio for attrited customers is 20% & its 0% for existing customers. Close to 75% of existing customers have an Avg_Utilization_Ratio less than the median of attrited customers.

Attrition_Flag vs Months_on_book

In [ ]:
distribution_plot_wrt_target(data, "Months_on_book", "Attrition_Flag")

Observations Month on Books data is normally distributed for existing customer and attrited customers.

Attrition_Flag vs Avg_Open_To_Buy

In [ ]:
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")

Observations Avg_Open_To_Buy is identically distributed for existing customer and attrited customers.

Let's see the attributes that have a strong correlation with each other

Correlation Check

In [ ]:
#Creating correlation matrix to show any correlation between variables
plt.figure(figsize=(15, 7))
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
sns.heatmap(data[numeric_columns].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observations *Values of 1 are highly positively correlated, values of -1 are highly negatively correlated.

*Avg_Open_to_Buy and Credit_Limit are completely positively correlated by necessity. As a customer's credit limit goes up, their open to buy also increases.

*Total_Trans_Amt and Total_Trans_Ct are very highly positively correlated. This is natural as the more number of transations a customer makes, the more money customer will spend.

*Customer_Age and Months_on_book are highly positively correlated, as customer Age increases, their time with the bank increases.

*Total_Revolving balance and Avg_Utilization_Ratio is positively correlated. This makes sense because if a customer has a high utilization, they will likely have a higher revolving balance.

*Avg_Open_To_Buy and Avg_Utilization_Ratio are negatively correlated as higher the customers utilization is, the less their amount open to buy will be.

*Credit_Limit and Avg_Utilization_Ratio are negatively correlated,as customers with a higher credit limit tend to have a lower utilization.

Data Preprocessing¶

Outlier Detection¶

In [ ]:
Q1 = data.quantile(0.25,numeric_only=True)  # To find the 25th percentile
Q3 = data.quantile(0.75, numeric_only=True)  # To find the 75th percentile

IQR = Q3 - Q1  # Inter Quantile Range (75th perentile - 25th percentile)

# Finding lower and upper bounds for all values. All values outside these bounds are outliers
lower = (Q1 - 1.5 * IQR)
upper = (Q3 + 1.5 * IQR)
In [ ]:
# checking the % outliers
((data.select_dtypes(include=["float64", "int64"]) < lower) | (data.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(data) * 100
Out[ ]:
Attrition_Flag             16.066
Customer_Age                0.020
Dependent_count             0.000
Months_on_book              3.812
Total_Relationship_Count    0.000
Months_Inactive_12_mon      3.268
Contacts_Count_12_mon       6.211
Credit_Limit                9.717
Total_Revolving_Bal         0.000
Avg_Open_To_Buy             9.509
Total_Amt_Chng_Q4_Q1        3.910
Total_Trans_Amt             8.848
Total_Trans_Ct              0.020
Total_Ct_Chng_Q4_Q1         3.891
Avg_Utilization_Ratio       0.000
dtype: float64

Train-Test Split¶

In [ ]:
# creating the copy of the dataframe
data1 = data.copy()
In [ ]:
data1.isna().sum()
Out[ ]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category             1112
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Observation - to avoid Data leakage, all the null values Education_Level(1519), Marital_Status(749),Income_Category(1112) in columns above will be imputed after train-test data split

In [ ]:
# Creating a list with column labels that need to be converted from "object" to "category" data type.
cat_cols = [
    'Attrition_Flag',
    'Gender',
    'Education_Level',
    'Marital_Status',
    'Card_Category',
    'Income_Category'
]

# Converting the columns with "object" data type to "category" data type.
data[cat_cols] = data[cat_cols].astype('category')

Observation Converted columns with data type of "object" to "category" for further use in model building & analysis.

In [ ]:
# Verifying conversion of 'Object'  to 'category' the data types of the new data frame.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  int64   
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           9015 non-null   category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  int64   
 10  Months_Inactive_12_mon    10127 non-null  int64   
 11  Contacts_Count_12_mon     10127 non-null  int64   
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(6), float64(5), int64(9)
memory usage: 1.1 MB

Observations Observed ''Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Card_Category', 'Income_Category' got converted to Category dtypes

In [ ]:
# Dividing train data into X and y

X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]

Splitting Data into Train-Test¶

In [ ]:
# Splitting data into training, validation and test set
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)

Missing value imputation¶

In [ ]:
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")
In [ ]:
reqd_col_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]
In [ ]:
# Fitting and transforming the train data to impute missing values in X_train set
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])

# Fitting & Transforming the validation data to impute missing values in X_val set
X_val[reqd_col_for_impute]  =  imputer.fit_transform(X_val[reqd_col_for_impute])

# Fitting & Transform the test data to impute missing values in X_test set
X_test[reqd_col_for_impute] =  imputer.fit_transform(X_test[reqd_col_for_impute])
In [ ]:
# Printing no column is showing missing values in train, val, test set.
print(X_train.isna().sum())
print("*" * 40)
print(X_val.isna().sum())
print("*" * 40)
print(X_test.isna().sum())
print("*" * 40)
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
****************************************
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
****************************************
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
****************************************
In [ ]:
# Printing & verifying the size & percentages of classes of the Training, Validation, and Test data frames, after missing value imputation.
print("*"*40)
print("Shape of Training Set : ", X_train.shape)
print("Shape of Validation Set", X_val.shape)
print("Shape of Test Set : ", X_test.shape)
print("*"*40)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("*"*40)
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("*"*40)
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
print("*"*40)
****************************************
Shape of Training Set :  (6075, 19)
Shape of Validation Set (2026, 19)
Shape of Test Set :  (2026, 19)
****************************************
Percentage of classes in training set:
Attrition_Flag
0   0.839
1   0.161
Name: proportion, dtype: float64
****************************************
Percentage of classes in validation set:
Attrition_Flag
0   0.839
1   0.161
Name: proportion, dtype: float64
****************************************
Percentage of classes in test set:
Attrition_Flag
0   0.840
1   0.160
Name: proportion, dtype: float64
****************************************

Observations *Splitted data successfully into training, validation, and test sets.

*All Models will be trained on training data, and evaluated on validation data.

*The best models will be tuned and finally evaluated on the test data.(prod data)

In [ ]:
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("*" * 40)
Gender
F    3193
M    2882
Name: count, dtype: int64
****************************************
Education_Level
Graduate         2782
High School      1228
Uneducated        881
College           618
Post-Graduate     312
Doctorate         254
Name: count, dtype: int64
****************************************
Marital_Status
Married     3276
Single      2369
Divorced     430
Name: count, dtype: int64
****************************************
Income_Category
Less than $40K    2783
$40K - $60K       1059
$80K - $120K       953
$60K - $80K        831
$120K +            449
Name: count, dtype: int64
****************************************
Card_Category
Blue        5655
Silver       339
Gold          69
Platinum      12
Name: count, dtype: int64
****************************************
In [ ]:
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_val[i].value_counts())
    print("*" * 40)
Gender
F    1095
M     931
Name: count, dtype: int64
****************************************
Education_Level
Graduate         917
High School      404
Uneducated       306
College          199
Post-Graduate    101
Doctorate         99
Name: count, dtype: int64
****************************************
Marital_Status
Married     1100
Single       770
Divorced     156
Name: count, dtype: int64
****************************************
Income_Category
Less than $40K    957
$40K - $60K       361
$80K - $120K      293
$60K - $80K       279
$120K +           136
Name: count, dtype: int64
****************************************
Card_Category
Blue        1905
Silver        97
Gold          21
Platinum       3
Name: count, dtype: int64
****************************************
In [ ]:
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_test[i].value_counts())
    print("*" * 40)
Gender
F    1070
M     956
Name: count, dtype: int64
****************************************
Education_Level
Graduate         948
High School      381
Uneducated       300
College          196
Post-Graduate    103
Doctorate         98
Name: count, dtype: int64
****************************************
Marital_Status
Married     1060
Single       804
Divorced     162
Name: count, dtype: int64
****************************************
Income_Category
Less than $40K    933
$40K - $60K       370
$60K - $80K       292
$80K - $120K      289
$120K +           142
Name: count, dtype: int64
****************************************
Card_Category
Blue        1876
Silver       119
Gold          26
Platinum       5
Name: count, dtype: int64
****************************************

Encoding categorical variables¶

In [ ]:
#using drop first to avoid multicolonarity, Dropping first of each encoded column to reduce data frame size.
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val,drop_first=True)
X_test = pd.get_dummies(X_test,drop_first=True)
# Printing shape of new dataframe
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 29) (2026, 29) (2026, 29)

Observations

*Encoded categorical columns for model building.

*Dropped 1 dummy variable column from each category as its not needed to have all columns.

*After encoding there are 29 columns (including dummies)

In [ ]:
# Checking information of new train data frame columns (29)
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6075 entries, 800 to 4035
Data columns (total 29 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Customer_Age                    6075 non-null   int64  
 1   Dependent_count                 6075 non-null   int64  
 2   Months_on_book                  6075 non-null   int64  
 3   Total_Relationship_Count        6075 non-null   int64  
 4   Months_Inactive_12_mon          6075 non-null   int64  
 5   Contacts_Count_12_mon           6075 non-null   int64  
 6   Credit_Limit                    6075 non-null   float64
 7   Total_Revolving_Bal             6075 non-null   int64  
 8   Avg_Open_To_Buy                 6075 non-null   float64
 9   Total_Amt_Chng_Q4_Q1            6075 non-null   float64
 10  Total_Trans_Amt                 6075 non-null   int64  
 11  Total_Trans_Ct                  6075 non-null   int64  
 12  Total_Ct_Chng_Q4_Q1             6075 non-null   float64
 13  Avg_Utilization_Ratio           6075 non-null   float64
 14  Gender_M                        6075 non-null   bool   
 15  Education_Level_Doctorate       6075 non-null   bool   
 16  Education_Level_Graduate        6075 non-null   bool   
 17  Education_Level_High School     6075 non-null   bool   
 18  Education_Level_Post-Graduate   6075 non-null   bool   
 19  Education_Level_Uneducated      6075 non-null   bool   
 20  Marital_Status_Married          6075 non-null   bool   
 21  Marital_Status_Single           6075 non-null   bool   
 22  Income_Category_$40K - $60K     6075 non-null   bool   
 23  Income_Category_$60K - $80K     6075 non-null   bool   
 24  Income_Category_$80K - $120K    6075 non-null   bool   
 25  Income_Category_Less than $40K  6075 non-null   bool   
 26  Card_Category_Gold              6075 non-null   bool   
 27  Card_Category_Platinum          6075 non-null   bool   
 28  Card_Category_Silver            6075 non-null   bool   
dtypes: bool(15), float64(5), int64(9)
memory usage: 800.9 KB
In [ ]:
# Checking information of new validation set data frame's columns.
X_val.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2026 entries, 2894 to 6319
Data columns (total 29 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Customer_Age                    2026 non-null   int64  
 1   Dependent_count                 2026 non-null   int64  
 2   Months_on_book                  2026 non-null   int64  
 3   Total_Relationship_Count        2026 non-null   int64  
 4   Months_Inactive_12_mon          2026 non-null   int64  
 5   Contacts_Count_12_mon           2026 non-null   int64  
 6   Credit_Limit                    2026 non-null   float64
 7   Total_Revolving_Bal             2026 non-null   int64  
 8   Avg_Open_To_Buy                 2026 non-null   float64
 9   Total_Amt_Chng_Q4_Q1            2026 non-null   float64
 10  Total_Trans_Amt                 2026 non-null   int64  
 11  Total_Trans_Ct                  2026 non-null   int64  
 12  Total_Ct_Chng_Q4_Q1             2026 non-null   float64
 13  Avg_Utilization_Ratio           2026 non-null   float64
 14  Gender_M                        2026 non-null   bool   
 15  Education_Level_Doctorate       2026 non-null   bool   
 16  Education_Level_Graduate        2026 non-null   bool   
 17  Education_Level_High School     2026 non-null   bool   
 18  Education_Level_Post-Graduate   2026 non-null   bool   
 19  Education_Level_Uneducated      2026 non-null   bool   
 20  Marital_Status_Married          2026 non-null   bool   
 21  Marital_Status_Single           2026 non-null   bool   
 22  Income_Category_$40K - $60K     2026 non-null   bool   
 23  Income_Category_$60K - $80K     2026 non-null   bool   
 24  Income_Category_$80K - $120K    2026 non-null   bool   
 25  Income_Category_Less than $40K  2026 non-null   bool   
 26  Card_Category_Gold              2026 non-null   bool   
 27  Card_Category_Platinum          2026 non-null   bool   
 28  Card_Category_Silver            2026 non-null   bool   
dtypes: bool(15), float64(5), int64(9)
memory usage: 267.1 KB
In [ ]:
# Checking information of new test data frame's columns.
X_test.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2026 entries, 9760 to 413
Data columns (total 29 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Customer_Age                    2026 non-null   int64  
 1   Dependent_count                 2026 non-null   int64  
 2   Months_on_book                  2026 non-null   int64  
 3   Total_Relationship_Count        2026 non-null   int64  
 4   Months_Inactive_12_mon          2026 non-null   int64  
 5   Contacts_Count_12_mon           2026 non-null   int64  
 6   Credit_Limit                    2026 non-null   float64
 7   Total_Revolving_Bal             2026 non-null   int64  
 8   Avg_Open_To_Buy                 2026 non-null   float64
 9   Total_Amt_Chng_Q4_Q1            2026 non-null   float64
 10  Total_Trans_Amt                 2026 non-null   int64  
 11  Total_Trans_Ct                  2026 non-null   int64  
 12  Total_Ct_Chng_Q4_Q1             2026 non-null   float64
 13  Avg_Utilization_Ratio           2026 non-null   float64
 14  Gender_M                        2026 non-null   bool   
 15  Education_Level_Doctorate       2026 non-null   bool   
 16  Education_Level_Graduate        2026 non-null   bool   
 17  Education_Level_High School     2026 non-null   bool   
 18  Education_Level_Post-Graduate   2026 non-null   bool   
 19  Education_Level_Uneducated      2026 non-null   bool   
 20  Marital_Status_Married          2026 non-null   bool   
 21  Marital_Status_Single           2026 non-null   bool   
 22  Income_Category_$40K - $60K     2026 non-null   bool   
 23  Income_Category_$60K - $80K     2026 non-null   bool   
 24  Income_Category_$80K - $120K    2026 non-null   bool   
 25  Income_Category_Less than $40K  2026 non-null   bool   
 26  Card_Category_Gold              2026 non-null   bool   
 27  Card_Category_Platinum          2026 non-null   bool   
 28  Card_Category_Silver            2026 non-null   bool   
dtypes: bool(15), float64(5), int64(9)
memory usage: 267.1 KB

Model Building¶

Model evaluation criterion¶

Model can make wrong predictions as:

  • Predicting a customer will attrite but the customer doesn't attrite (False Positive)
  • Predicting a customer will not attrite but the customer attrites (False Negative)

Which case is more important?

  • Predicting that customer will not attrite but if customer attrites. This means missing the chance to retain a valuable customer or asset.

How to reduce this loss i.e need to reduce False Negatives??

-The bank should prioritize maximizing Recall, as a higher Recall reduces the likelihood of false negatives. This means focusing on increasing Recall or minimizing false negatives, effectively identifying true positives (i.e., Class 1). By accurately identifying customers at risk of attrition, the bank can better retain its valuable customers.

Let's define a function to output different metrics (including recall) on the train and test data sets and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
# Defining a function to create a confusion matrix to check TP, FP, TN, adn FN values.
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building - Original Data¶

In [ ]:
 # Getting Recall scores for 6 models that were fit on orginal training data.
 # Appending all the models into the list
models = []
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))  # Append Gradient Boosting
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))  # Append AdaBoost
models.append(("Decisiontree",DecisionTreeClassifier(random_state=1))) # Append DecisionTree
models.append(("XGB", XGBClassifier(random_state=1)))#Append XG Boost
print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.985655737704918
Random forest: 1.0
Gradient Boost: 0.875
AdaBoost: 0.826844262295082
Decisiontree: 1.0
XGB: 1.0

Validation Performance:

Bagging: 0.8128834355828221
Random forest: 0.7975460122699386
Gradient Boost: 0.8558282208588958
AdaBoost: 0.852760736196319
Decisiontree: 0.8159509202453987
XGB: 0.901840490797546
In [ ]:
#Getting Training and Validation Performance Difference

print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
    # Fit the model on the training data
    model.fit(X_train, y_train)
    # Calculate recall scores for training and validation sets
    scores_train = recall_score(y_train, model.predict(X_train))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference1 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference1))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9857, Validation Score: 0.8129, Difference: 0.1728
Random forest: Training Score: 1.0000, Validation Score: 0.7975, Difference: 0.2025
Gradient Boost: Training Score: 0.8750, Validation Score: 0.8558, Difference: 0.0192
AdaBoost: Training Score: 0.8268, Validation Score: 0.8528, Difference: -0.0259
Decisiontree: Training Score: 1.0000, Validation Score: 0.8160, Difference: 0.1840
XGB: Training Score: 1.0000, Validation Score: 0.9018, Difference: 0.0982

Observations Model Building - Original Data

The top 3 models based on the Validation Recall scores and performance difference are:

  1. XGBoost (XGB) 2.Gradient Boost 3.AdaBoost

These models have the highest validation scores, and their performance differences indicate a balance between fitting and generalizing well on unseen data.

Model Building - Oversampled Data¶

In [ ]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)


print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976
Before Oversampling, counts of label 'No': 5099 

After Oversampling, counts of label 'Yes': 5099
After Oversampling, counts of label 'No': 5099 

After Oversampling, the shape of train_X: (10198, 29)
After Oversampling, the shape of train_y: (10198,) 

In [ ]:
 # Getting Recall scores for 6 models that were fit on oversampled data.
 # Appending all the models into the list
models = [] # Empty list to store all the models
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))  # Append Gradient Boosting
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))  # Append AdaBoost
models.append(("Decisiontree",DecisionTreeClassifier(random_state=1))) # Append DecisionTree
models.append(("XGB", XGBClassifier(random_state=1)))#Append XG Boost
print("\n" "Training Performance:" "\n")
for name, model in models:
    # Fit the model on the training data
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9976465973720338
Random forest: 1.0
Gradient Boost: 0.9792116101196313
AdaBoost: 0.964698960580506
Decisiontree: 1.0
XGB: 1.0

Validation Performance:

Bagging: 0.8619631901840491
Random forest: 0.8619631901840491
Gradient Boost: 0.9049079754601227
AdaBoost: 0.901840490797546
Decisiontree: 0.8650306748466258
XGB: 0.9294478527607362
In [ ]:
#Getting Training and Validation Performance Difference on oversampled data
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores_train = recall_score(y_train_over, model.predict(X_train_over))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference2 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9976, Validation Score: 0.8620, Difference: 0.1357
Random forest: Training Score: 1.0000, Validation Score: 0.8620, Difference: 0.1380
Gradient Boost: Training Score: 0.9792, Validation Score: 0.9049, Difference: 0.0743
AdaBoost: Training Score: 0.9647, Validation Score: 0.9018, Difference: 0.0629
Decisiontree: Training Score: 1.0000, Validation Score: 0.8650, Difference: 0.1350
XGB: Training Score: 1.0000, Validation Score: 0.9294, Difference: 0.0706

Observations Model Building - Oversampled Data

The top 3 models based on the Validation Recall scores and the differences are:

1.XGBoost (XGB)

2.Gradient Boost

3.AdaBoost

These models have the highest validation scores and relatively low differences between training and validation scores, indicating a good balance between fitting the training data and generalizing well to unseen data.

Model Building - Undersampled Data¶

In [ ]:
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [ ]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976
Before Under Sampling, counts of label 'No': 5099 

After Under Sampling, counts of label 'Yes': 976
After Under Sampling, counts of label 'No': 976 

After Under Sampling, the shape of train_X: (1952, 29)
After Under Sampling, the shape of train_y: (1952,) 

In [ ]:
 # Getting Recall scores for 6 models that were fit on undersampled data.
 # Appending all the models into the list
models = []
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))  # Append Gradient Boosting
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))  # Append AdaBoost
models.append(("Decisiontree",DecisionTreeClassifier(random_state=1))) # Append DecisionTree
models.append(("XGB", XGBClassifier(random_state=1)))#Append XG Boost
print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))
    print("{}: {}".format(name, scores))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))
Training Performance:

Bagging: 0.9907786885245902
Random forest: 1.0
Gradient Boost: 0.9805327868852459
AdaBoost: 0.9528688524590164
Decisiontree: 1.0
XGB: 1.0

Validation Performance:

Bagging: 0.9294478527607362
Random forest: 0.9386503067484663
Gradient Boost: 0.9570552147239264
AdaBoost: 0.9601226993865031
Decisiontree: 0.9202453987730062
XGB: 0.9693251533742331
In [ ]:
#Getting Training and Validation Performance Difference on undersampled data
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores_train = recall_score(y_train_un, model.predict(X_train_un))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference3 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9908, Validation Score: 0.9294, Difference: 0.0613
Random forest: Training Score: 1.0000, Validation Score: 0.9387, Difference: 0.0613
Gradient Boost: Training Score: 0.9805, Validation Score: 0.9571, Difference: 0.0235
AdaBoost: Training Score: 0.9529, Validation Score: 0.9601, Difference: -0.0073
Decisiontree: Training Score: 1.0000, Validation Score: 0.9202, Difference: 0.0798
XGB: Training Score: 1.0000, Validation Score: 0.9693, Difference: 0.0307

Observations

The top 3 models based on the Validation Recall scores and diffrences are:

1.XGBoost (XGB)

2.AdaBoost

3.Gradient Boost

These models have the highest validation scores and the smallest differences between training and validation scores, indicating a strong balance between fitting the training data and generalizing well to unseen data.

Hyperparameter Tuning¶

Note¶

  1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
    • Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase
  2. The models chosen in this notebook are based on test runs. One can update the best models as obtained upon code execution and tune them for best performance.

Tuning AdaBoost using original data¶

In [ ]:
#Getting Best Parameters,CV score using RandomizedSearchCV
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)## to fit the model on original data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8360596546310832:
CPU times: user 3.98 s, sys: 270 ms, total: 4.25 s
Wall time: 1min 47s
In [ ]:
# Creating new pipeline with best parameters
tuned_adb_orig = AdaBoostClassifier(random_state=1,
    n_estimators= 100 , learning_rate=0.1 , base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1))

tuned_adb_orig.fit(X_train, y_train)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
In [ ]:
adb_train_orig = model_performance_classification_sklearn(tuned_adb_orig,X_train ,y_train)
adb_train_orig
Out[ ]:
Accuracy Recall Precision F1
0 0.982 0.927 0.961 0.944
In [ ]:
# Saving the tuned model's scores for later comparison.
adb_train_orig_score = model_performance_classification_sklearn(tuned_adb_orig, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the original training data.
confusion_matrix_sklearn(tuned_adb_orig, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Original(train data)")
plt.show()
In [ ]:
# Checking model's performance on validation set
adb_val_orig =  model_performance_classification_sklearn(tuned_adb_orig, X_val, y_val)
adb_val_orig
Out[ ]:
Accuracy Recall Precision F1
0 0.967 0.856 0.933 0.893
In [ ]:
# Saving the tuned model's scores for later comparison.
adb_val_orig_score = model_performance_classification_sklearn(tuned_adb_orig, X_val, y_val)
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_adb_orig, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Original (Validation data)")
plt.show()

Tuning Ada Boost using undersampled data¶

In [ ]:
%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9467346938775512:
CPU times: user 1.84 s, sys: 119 ms, total: 1.96 s
Wall time: 41.1 s
In [ ]:
# Creating new pipeline with best parameters
tuned_ada_un = AdaBoostClassifier( random_state=1,
    n_estimators=100, learning_rate= 0.05, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
)

tuned_ada_un.fit(X_train_un, y_train_un)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.05, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.05, n_estimators=100, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
In [ ]:
adb_train_un = model_performance_classification_sklearn(tuned_ada_un, X_train_un, y_train_un)
adb_train_un
Out[ ]:
Accuracy Recall Precision F1
0 0.973 0.978 0.968 0.973
In [ ]:
#Saving the tuned model's scores for later comparison.
adb_train_un_score = model_performance_classification_sklearn(tuned_ada_un, X_train_un, y_train_un)
# Creating the confusion matrix for the tuned model's performance on the Undersample training data.
confusion_matrix_sklearn (tuned_ada_un, X_train_un, y_train_un)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Undersample(train data)")
plt.show()
In [ ]:
# Checking model's performance on validation set
adb_val_un = model_performance_classification_sklearn(tuned_ada_un, X_val, y_val)
adb_val_un
Out[ ]:
Accuracy Recall Precision F1
0 0.937 0.966 0.731 0.832
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the Undersample validation data.
confusion_matrix_sklearn(tuned_ada_un, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Undersample(Validation data)")
plt.show()

Tuning AdaBoost using Oversampled data¶

In [ ]:
%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9515668956493293:
CPU times: user 6.72 s, sys: 338 ms, total: 7.06 s
Wall time: 2min 52s
In [ ]:
# Creating new pipeline with best parameters
tuned_ada_over = AdaBoostClassifier( random_state=1,  n_estimators= 100, learning_rate= 0.1, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1))

tuned_ada_over.fit( X_train_over, y_train_over)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
In [ ]:
adb_train_over = model_performance_classification_sklearn(tuned_ada_over, X_train_over, y_train_over)
adb_train_over
Out[ ]:
Accuracy Recall Precision F1
0 0.985 0.985 0.985 0.985
In [ ]:
#Saving the tuned model's scores for later comparison.
adb_train_over_score = model_performance_classification_sklearn(tuned_ada_over, X_train_over, y_train_over)
# Creating the confusion matrix for the tuned model's performance on the oversampled training data.
confusion_matrix_sklearn (tuned_ada_over, X_train_over, y_train_over)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Oversample(train data)")
plt.show()
In [ ]:
# Checking model's performance on validation set
adb_val_over = model_performance_classification_sklearn(tuned_ada_over, X_val, y_val)
adb_val_over
Out[ ]:
Accuracy Recall Precision F1
0 0.968 0.908 0.894 0.901
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_ada_over, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Original(Validation data)")
plt.show()

Tuning Gradient Boosting using original data¶

In [ ]:
%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)


print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8104395604395604:
CPU times: user 3.9 s, sys: 412 ms, total: 4.32 s
Wall time: 2min 41s
In [ ]:
# Creating new pipeline with best parameters
tuned_gbm_orig = GradientBoostingClassifier(
    max_features=0.5,
    init=AdaBoostClassifier(random_state=1),
    random_state=1,
    learning_rate=0.1,
    n_estimators= 100,
    subsample=0.9,
)

tuned_gbm_orig.fit(X_train, y_train)
Out[ ]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [ ]:
gbm_train_orig =  model_performance_classification_sklearn(
    tuned_gbm_orig, X_train, y_train)

gbm_train_orig
Out[ ]:
Accuracy Recall Precision F1
0 0.972 0.867 0.955 0.909
In [ ]:
# Saving the tuned model's scores for later comparison.
gbm_train_orig_score = model_performance_classification_sklearn(tuned_gbm_orig, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the original training data.
confusion_matrix_sklearn(tuned_gbm_orig, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Original(train data)")
plt.show()
In [ ]:
# Checking model's performance on validation set
gbm_val_orig = model_performance_classification_sklearn(tuned_gbm_orig, X_val, y_val)
gbm_val_orig
Out[ ]:
Accuracy Recall Precision F1
0 0.968 0.862 0.937 0.898
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_gbm_orig, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Original(Validation data)")
plt.show()

Tuning Gradient Boosting using undersampled data¶

In [ ]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9508267922553637:
CPU times: user 2.16 s, sys: 184 ms, total: 2.35 s
Wall time: 1min 13s
In [ ]:
# Creating new pipeline with best parameters
tuned_gbm_un = GradientBoostingClassifier(
    max_features=0.7,
    init=AdaBoostClassifier(random_state=1),
    random_state=1,
    learning_rate=0.1,
    n_estimators=75,
    subsample=0.9,
)

tuned_gbm_un.fit(X_train_un, y_train_un)
Out[ ]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [ ]:
gbm_train_un = model_performance_classification_sklearn(tuned_gbm_un,X_train_un ,y_train_un)
gbm_train_un
Out[ ]:
Accuracy Recall Precision F1
0 0.970 0.977 0.964 0.970
In [ ]:
# Saving the tuned model's scores for later comparison.
gbm_train_un_score = model_performance_classification_sklearn(tuned_gbm_un, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the undersampled training data.
confusion_matrix_sklearn(tuned_gbm_un, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Undersample(train data)")
plt.show()
In [ ]:
gbm_val_un = model_performance_classification_sklearn(tuned_gbm_un,X_val ,y_val)
gbm_val_un
Out[ ]:
Accuracy Recall Precision F1
0 0.938 0.957 0.738 0.833
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the undersample validation data.
confusion_matrix_sklearn(tuned_gbm_un, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Undersample(Validation data)")
plt.show()

Tuning Gradient Boosting using over sampled data¶

In [ ]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)


print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9541157228347668:
CPU times: user 5.97 s, sys: 632 ms, total: 6.6 s
Wall time: 4min 23s
In [ ]:
# Creating new pipeline with best parameters
tuned_gbm_over = GradientBoostingClassifier(
    max_features=0.5,
    init=AdaBoostClassifier(random_state=1),
    random_state=1,
    learning_rate=0.1,
    n_estimators=100,
    subsample=0.9,
)

tuned_gbm_over.fit(X_train_over, y_train_over)
Out[ ]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [ ]:
gbm_train_over = model_performance_classification_sklearn(tuned_gbm_over,X_train_over ,y_train_over)
gbm_train_over
Out[ ]:
Accuracy Recall Precision F1
0 0.975 0.979 0.972 0.975
In [ ]:
# Saving the tuned model's scores for later comparison.
gbm_train_over_score = model_performance_classification_sklearn(tuned_gbm_over, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the oversampled training data.
confusion_matrix_sklearn(tuned_gbm_over, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Oversample(train data)")
plt.show()
In [ ]:
gbm_val_over = model_performance_classification_sklearn(tuned_gbm_over,X_val ,y_val)
gbm_val_over
Out[ ]:
Accuracy Recall Precision F1
0 0.961 0.911 0.853 0.881
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the oversampled validation data.
confusion_matrix_sklearn(tuned_gbm_over, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Oversample(Validation data)")
plt.show()

Tuning XGBoost Model with Original data¶

In [ ]:
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
           }
from sklearn import metrics

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.921098901098901:
CPU times: user 2.12 s, sys: 195 ms, total: 2.32 s
Wall time: 1min 5s
In [ ]:
tuned_xgb_orig = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=0.9,
    scale_pos_weight=5,
    n_estimators=100,
    learning_rate=0.1,
    gamma=3,
)
tuned_xgb_orig.fit(X_train, y_train)
Out[ ]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In [ ]:
xgb_train_orig = model_performance_classification_sklearn (tuned_xgb_orig,X_train ,y_train)
xgb_train_orig
Out[ ]:
Accuracy Recall Precision F1
0 0.988 1.000 0.932 0.965
In [ ]:
# Saving the tuned model's scores for later comparison.
xgb_train_orig_score = model_performance_classification_sklearn(tuned_xgb_orig, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the original training data.
confusion_matrix_sklearn(tuned_xgb_orig, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Original(train data)")
plt.show()
In [ ]:
xgb_val_orig = model_performance_classification_sklearn(tuned_xgb_orig,X_val ,y_val)
xgb_val_orig
Out[ ]:
Accuracy Recall Precision F1
0 0.965 0.942 0.855 0.896
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_xgb_orig, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Original(validation data)")
plt.show()

Tuning XGBoost Model with undersampled data¶

In [ ]:
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
           }
from sklearn import metrics

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.01, 'gamma': 3} with CV score=0.9979591836734695:
CPU times: user 1.61 s, sys: 109 ms, total: 1.72 s
Wall time: 38.4 s
In [ ]:
tuned_xgb_un = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=0.7,
    scale_pos_weight=5,
    n_estimators=100,
    learning_rate=0.01,
    gamma=3,
)
tuned_xgb_un.fit(X_train_un, y_train_un)
Out[ ]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In [ ]:
xgb_train_un = model_performance_classification_sklearn(tuned_xgb_un, X_train, y_train)
xgb_train_un
Out[ ]:
Accuracy Recall Precision F1
0 0.779 1.000 0.421 0.593
In [ ]:
# Saving the tuned model's scores for later comparison.
xgb_train_un_score = model_performance_classification_sklearn(tuned_xgb_un, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the undersampled training data.
confusion_matrix_sklearn(tuned_xgb_un, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Undersample(train data)")
plt.show()
In [ ]:
xgb_val_un = model_performance_classification_sklearn(tuned_xgb_un, X_val, y_val)
xgb_val_un
Out[ ]:
Accuracy Recall Precision F1
0 0.777 0.994 0.419 0.590
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the undersampled validation data.
confusion_matrix_sklearn(tuned_xgb_un, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Undersample(validation data)")
plt.show()

Tuning XGBoost Model with oversampled data¶

In [ ]:
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
           }
from sklearn import metrics

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.01, 'gamma': 3} with CV score=0.9994117647058823:
CPU times: user 2.57 s, sys: 285 ms, total: 2.86 s
Wall time: 1min 25s
In [ ]:
tuned_xgb_over = XGBClassifier(
    random_state=1,
    eval_metric="logloss",
    subsample=0.7,
    scale_pos_weight=5,
    n_estimators=50,
    learning_rate=0.01,
    gamma=3,
)

tuned_xgb_over.fit(X_train_over, y_train_over)
Out[ ]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=3, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.01, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=50,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In [ ]:
xgb_train_over = model_performance_classification_sklearn(tuned_xgb_over,X_train_over ,y_train_over)
xgb_train_over
Out[ ]:
Accuracy Recall Precision F1
0 0.792 1.000 0.706 0.828
In [ ]:
# Saving the tuned model's scores for later comparison.
xgb_train_over_score = model_performance_classification_sklearn(tuned_xgb_over, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the Oversample training data.
confusion_matrix_sklearn(tuned_xgb_over, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Oversample(train data)")
plt.show()
In [ ]:
xgb_val_over = model_performance_classification_sklearn(tuned_xgb_over, X_val, y_val)
xgb_val_over
Out[ ]:
Accuracy Recall Precision F1
0 0.655 1.000 0.318 0.483
In [ ]:
# Creating the confusion matrix for the tuned model's performance on the Oversampled validation data.
confusion_matrix_sklearn(tuned_xgb_over, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Oversample(validation data)")
plt.show()

Model Comparison and Final Model Selection¶

In [ ]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        xgb_train_orig.T,
        gbm_train_orig.T,
        adb_train_orig.T,
        xgb_train_over.T,
        gbm_train_over.T,
        adb_train_over.T,
        xgb_train_un.T,
        gbm_train_un.T,
        adb_train_un.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
      "XGBoost trained with Original data",
      "Gradient boosting trained with Original data",
      "AdaBoost trained with Original data",
      "XGBoost trained with Oversampled data",
      "Gradient boosting trained with Oversampled data",
      "AdaBoost trained with Oversampled data",
      "XGBoost boosting trained with Undersampled data",
      "Gradient trained with Undersampled data",
      "AdaBoost trained with Undersampled data"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
XGBoost trained with Original data Gradient boosting trained with Original data AdaBoost trained with Original data XGBoost trained with Oversampled data Gradient boosting trained with Oversampled data AdaBoost trained with Oversampled data XGBoost boosting trained with Undersampled data Gradient trained with Undersampled data AdaBoost trained with Undersampled data
Accuracy 0.988 0.972 0.982 0.792 0.975 0.985 0.779 0.970 0.973
Recall 1.000 0.867 0.927 1.000 0.979 0.985 1.000 0.977 0.978
Precision 0.932 0.955 0.961 0.706 0.972 0.985 0.421 0.964 0.968
F1 0.965 0.909 0.944 0.828 0.975 0.985 0.593 0.970 0.973
In [ ]:
# validation performance comparison

models_val_comp_df = pd.concat(
    [
        xgb_val_orig.T,
        gbm_val_orig.T,
        adb_val_orig.T,
        xgb_val_over.T,
        gbm_val_over.T,
        adb_val_over.T,
        xgb_val_un.T,
        gbm_val_un.T,
        adb_val_un.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
      "XGBoost Validation with Original data",
      "Gradient boosting Validation with Original data",
      "AdaBoost Validation with Original data",
      "XGBoost Validation with Oversampled data",
      "Gradient boosting Validation with Oversampled data",
      "AdaBoost Validation with Oversampled data",
      "XGBoost boosting Validation with Undersampled data",
      "Gradient boosting with Undersampled data",
      "AdaBoost Validation with Undersampled data"
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
Out[ ]:
XGBoost Validation with Original data Gradient boosting Validation with Original data AdaBoost Validation with Original data XGBoost Validation with Oversampled data Gradient boosting Validation with Oversampled data AdaBoost Validation with Oversampled data XGBoost boosting Validation with Undersampled data Gradient boosting with Undersampled data AdaBoost Validation with Undersampled data
Accuracy 0.988 0.972 0.982 0.792 0.975 0.985 0.779 0.970 0.973
Recall 1.000 0.867 0.927 1.000 0.979 0.985 1.000 0.977 0.978
Precision 0.932 0.955 0.961 0.706 0.972 0.985 0.421 0.964 0.968
F1 0.965 0.909 0.944 0.828 0.975 0.985 0.593 0.970 0.973

Best Model Selection Analysis

1.AdaBoost (Oversampled Data):

AdaBoost with Oversampled Data maintains a high recall (0.985) with perfect precision (0.985) and high accuracy (0.985), suggesting a well-balanced model that generalizes better.

AdaBoost trained with oversampled data is also the good model It demonstrates the highest and most balanced performance across all key metrics (Accuracy, Recall, Precision, F1) on the validation dataset. This indicates strong generalization and robustness, making it the ideal candidate for further tuning.

2.AdaBoost (Undersampled Data):

High and consistent performance with Accuracy: 0.977, Recall: 0.980, Precision: 0.974, and F1: 0.977

It demonstrates high and consistent performance across all key metrics (Accuracy, Recall, Precision, F1 score) on both training and validation datasets, making it a robust and reliable choice for prediction.

3. XGBoost (Original Data) :

XGBoost with original data also performs exceptionally well, especially considering its perfect recall and high F1 score.

The drop in precision to 0.932 suggests that it might be favoring recall at the cost of precision, potentially indicating overfitting.

Model Selection Conclusion :

Considering the balance between recall and overall model performance to avoid overfitting, the best model is AdaBoost with Oversampled Data, which achieves a high recall of 0.985 while maintaining strong performance across other metrics.

Observation:

final_model = AdaBoost with oversampled data.

AdaBoost trained with oversampled data is the best model for hyperparameter tuning. It demonstrates the highest and most balanced performance across all key metrics (Accuracy, Recall, Precision, F1) on the validation dataset. This indicates strong generalization and robustness, making it the ideal candidate for further tuning.

Now we have our final model, so let's find out how our final model is performing on unseen test data.

In [ ]:
# Let's check the performance on test set
final_model = tuned_ada_over
Model_test = model_performance_classification_sklearn(final_model, X_test, y_test)
Model_test
Out[ ]:
Accuracy Recall Precision F1
0 0.967 0.929 0.873 0.900
In [ ]:
# Plot confusion matrix on test set
confusion_matrix_sklearn(final_model, X_test, y_test)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost (Test Set)")
plt.show()

Feature Importances¶

In [ ]:
# Let's identify the iportant features
feature_names = X_train.columns
importances =  final_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observations:

Top 6 important features of the data set are:

Total_Trans_Amt

Total_Trans_Ct

Total_Revolving_Bal

Total_Amt_Chng_Q4_Q1

Total_Ct_Chng_Q4_Q1

Total_Relationship_Ct

Business Insights and Conclusions¶

Business Insights¶

  • Key Attrition Drivers:

    • Total_Trans_Amt: A decline in transaction frequency often signals impending attrition. Promotions targeting high-value purchases can stimulate transactions.
    • Total_Trans_Ct: Customers with fewer transactions are more prone to attrition. Incentives can encourage card usage.

    • Total_Revolving_Bal: Extreme balances contribute to higher attrition rates. Managing these balances can mitigate risk.

    • Total_Amt_Chng_Q4_Q1: Major Change in the transaction amount for later year spending could have the impact on attrition.

    • Total_Ct_Chng_Q4_Q1: Variations in spending patterns among existing customers suggest a propensity for later-year spending.

    • Total_Relationship_Count: Customers with fewer bank products are more likely to attrit. Enhancing product offerings and investigating issues related to product usage can aid retention.

  • Additional Insights:

    • Months_Inactive_12_mon: Prolonged periods of inactivity correlate with increased attrition risk. Automated engagement initiatives can re-engage inactive customers.
    • Contacts_Count_12_mon: Extensive interactions with the bank can indicate unresolved issues. Establishing a feedback mechanism can address customer concerns promptly.
    • Avg_Utilization_Ratio: Limited product usage increases attrition risk. Diversifying product offerings can strengthen customer relationships.
  • Proactive Customer Retention:

    • Identify at-risk customers to initiate tailored retention strategies, fostering loyalty.
    • Utilize the model as a diagnostic tool to uncover key attrition drivers and guide strategic decisions.
    • Leverage insights to implement customized retention efforts, addressing specific attrition drivers and customer segments.
  • Targeted Marketing:

    • Focus on customers who demonstrate higher credit utilization and transaction rates to foster engagement and retention.
    • Strengthen the bank's relationship with customers through cross-selling additional products to enhance loyalty and reduce attrition.

Conclusions and Recommendations¶

  • Boost Engagement:

    • Promote card usage through targeted offers and rewards tailored to customer spending habits to maintain high transaction levels.
    • Implement systems to monitor and address significant changes in transaction patterns to pre-empt customer attrition.
  • Cross-Selling:

    • Encourage the adoption of multiple products to enhance customer retention using personalized product recommendations and benefits education.
  • Retention Initiatives:

    • Develop loyalty programs for customers who maintain a certain level of transaction activity or revolving balance to incentivize continued usage.
  • Proactive Monitoring:

    • Implement systems to monitor and address significant changes in transaction patterns to pre-empt customer attrition.

By focusing on enhancing customer engagement and monitoring transaction behaviors, Thera Bank can better manage customer retention and strengthen its financial performance.